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Regression models with functional responses and covariates constitute a powerful and 
increasingly important model class. However, regression with functional data poses well 
known and challenging problems of non-identifiability. This non-identihability can manifest 
itself in arbitrarily large errors for coefficient surface estimates despite accurate predictions 
of the responses, thus invalidating substantial interpretations of the htted models. We offer 
an accessible rephrasing of these identifiability issues in realistic applications of penalized 
linear function-on-function-regression and delimit the set of circumstances under which 
they are likely to occur in practice. Specifically, non-identifiability that persists under 
smoothness assumptions on the coefficient surface can occur if the functional covariate’s 
empirical covariance has a kernel which overlaps that of the roughness penalty of the 
spline estimator. Extensive simulation studies validate the theoretical insights, explore 
the extent of the problem and allow us to evaluate their practical consequences under 
varying assumptions about the data generating processes. A case study illustrates the 
practical significance of the problem. Based on theoretical considerations and our empirical 
evaluation, we provide immediately applicable diagnostics for lack of identifiability and give 
recommendations for avoiding estimation artifacts in practice. 

1. Introduction 

The last two decades have seen rapid progress in regression methodology for high-dimensional 
data, largely driven by applications to genomic data in the “small n, large p” paradigm. 
In regression models with functional predictors, similar problems arise from the fact that 
covariate information comes in the shape of high-dimensional, strongly auto-correlated vec¬ 
tors of function evaluations. Whenever the number of regression parameters to estimate 
exceeds the number of observations, estimates are not unique and the resulting model is 
not identifiable. To overcome this lack of identifiability, it then becomes necessary to use 
heuristics or prior knowledge to impose additional structural constraints like sparsity or 
smoothness. Results then inherently depend - at least to some degree - on the assump¬ 
tions underlying the chosen regularization method. In the following, we present a detailed 
analysis of the way in which smoothness assumptions combined with properties of the data 
generating process affect estimation results for function-on-function regression. We differ¬ 
entiate between two kinds of non-identihability: First, simple non-identifiability arising 
from low information content in the functional data. This can be cured by imposing struc¬ 
tural assumptions like smoothness or sparsity on the estimators, i.e., by regularization of 
the estimators. Fig. I shows an example for 4 estimates under different structural assump¬ 
tions all yielding identical hts in such a scenario. Second, persistent non-identifiability that 
remains despite regularization for certain combinations of models and data. Figures 4 and 
8 show examples for the latter on synthetic and real data, respectively. 
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The problem of - especially persistent - non-identifiability is as yet under-appreciated 
in the functional data literature and analyzed here in depth for the first time. As software 
capable of fitting increasingly complex models with functional data becomes available (e.g. 
fda, Ramsay et al. (2014); fda.usc, Febrero-Bande and Oviedo de la Fuente (2012); refund, 
Huang et al. (2015); PACE, Yang et al. (2012); WFMM, Herrick (2013)), investigating 
the practical relevance of identifiability issues arising in these models is both timely and 
important in this rapidly developing field. The present work aims to perform such an 
investigation for the model class described in Scheipl et al. (2015) and implemented in 
refund’s pffr function, while results carry over to other penalized function-on-function 
regression approaches such as those implemented in the fda package. 

A popular approach in regression for functional data restricts the functional coefficients 
to lie in the span of the first K < n estimated eigenfunctions of a functional covariate’s 
covariance operator with the largest eigenvalues, (see Cardot et ah, 1999, 2003; Yao et ah, 
2005; Reiss and Ogden, 2007; Yao and Miiller, 2010; Wu et ah, 2010, for example). This 
functional principal component regression (FPCR) approach solves the problem of over¬ 
parameterization (i.e., non-identifiability of the functional effect) by a - usually drastic 
- dimension reduction. The main challenges in this approach then become 1) achieving 
good estimates of the covariance’s eigenfunctions (“functional principal components” or 
FPCs), eigenvalues, and FPC scores from observed functional data and 2) choosing the 
regularization parameter K. In practice, the effect of the functional covariate is estimated 
by using the first K estimated FPC scores as synthetic covariates. However, the critical 
assumption that the true coefficient lies in the span of the first few empirical eigenfunctions 
of a suitable (cross-)covariance operator estimate is impossible to verify empirically. Due 
to the often wiggly and unsmooth nature of eigenfunctions of real data this assumption 
can also lead to estimates that are difficult to interpret or implausible to practitioners. 

An alternative approach is to make assumptions on the functional coefficients informed 
by insights into the problem at hand, e.g., sparsity or smoothness of functional coeffi¬ 
cients, and to estimate these functional coefficients subject to an appropriate penalty (e.g., 
LASSO or smoothness penalties). In this work, we will focus on smooth spline-based pe¬ 
nalized regression models for functional responses with functional covariates as described 
in Ivanescu et al. (2015) and Scheipl et al. (2015), which constitute a powerful and flex¬ 
ible model class able to deal with the wealth of functional data increasingly collected in 
many fields of science. Nevertheless, our considerations carry over to other approaches to 
estimate smooth coefficient functions, such as approaches using derivative-based penalties 
as advocated by Ramsay and Silverman (2005). This paper describes the data settings in 
which penalized models remain unidentifiable despite the penalty in Section 3 and develops 
and evaluates suitable diagnostics and modified penalties for such settings. 

Identifiability issues in functional regression have previously been discussed in Cardot 
et al. (2003) in the context of functional regression models for scalar responses and also, 
briefly, in the context of models with both functional responses and functional predictors 
by He et al. (2000), Chiou et al. (2004) and Prchal and Sarda (2007). While results therein 
provide conditions for the theoretical existence and unicity of solutions based on functional 
analysis arguments, they do not yield empirically verifiable criteria to determine whether 
the conditions for unicity are violated for a given data set. They also always assume that 
the true coefficient surface lies in the space spanned by the eigenfunctions of a (cross- 
)covariance operator. As far as we are aware, case studies in the previous literature have 
implicitly assumed that this assumption and the necessary conditions based on it will be 
satisfied for observed data. This is problematic since 1) the theoretical conditions found 
in the previous literature are impossible to satisfy, or at least verify, on finite samples of 
functional data in finite resolution, and 2) our experience with applications of functional 
regression models as well as results from simulation studies indicate that persistent non- 
identifiability leading to spurious coefficient estimates may occur quite regularly. This 
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is obviously a concern for applied statisticians desiring interpretable regression models 
associating functional covariates and (functional) responses. 

Instead of relying on the functional analysis arguments suitable for investigations of 
asymptotic properties of the theoretical model, we use simple linear algebra to derive a 
condition for unicity of coefficient surface estimates in realistic, finite sample data settings 
in which functional covariates are observed with finite resolution in Section 3. This allows 
us to give a necessary and sufficient condition for persistent non-identifiability in penalized 
function-on-function regression that is empirically verifiable and thus applicable in realistic 
problems. The criterion is based on the amount of overlap between the kernel of the penalty 
matrix and the kernel of the design matrix for the functional effect. Simulation studies 
indicate that, in practice, severe errors due to non-identifiability are strongly associated 
with our criterion; thus, this criterion is the first one that can be used in a wide variety 
of applications to assess identifiability or lack thereof. Our analyses also indicate that 
many widely used preprocessing techniques for functional data which replace observed 
curves with spline-based or FPC-based low-rank approximations (i.e., pre-smoothing) or 
the centering of individual observed curves will considerably increase the likelihood of 
identifiability issues in many settings. 


Splines, 2nd DifF. Penalty 


Splines, 1st Diff. Penalty 


Splines, Ridge Penalty 


FPC-based 






Figure 1: Four coefRcient surfaces for regressing RCST-FA on pre-smoothed CCA-FA truncated to its first 6 
empirical FPCs. All lead to identical fitted values. First [second] panel: Tensor product spline-based 
fits (12 marginal cubic B-spline basis functions) with second [first] order difference penalty. Third 
panel: Tensor product spline-based fit with a Ridge penalty. Fourth panel: FPC regression result 
(6 FPCs). All fits performed with pffrO. 


We use the well-known dti data set - publicly available in R-package refund (Huang 
et ah, 2015) - to illustrate the issues we discuss here. These data contain spatially in¬ 
dexed, i.e., functional, measurements of fractional anisotropy (FA), a proxy variable for 
neuronal health, along 3 cerebral white matter tracts (WMTs) of multiple sclerosis (MS) 
patients. To illustrate how strongly different structural assumptions about the regression 
coefficient surface can affect the results, we fit simple univariate functional linear mod¬ 
els E{Y{t)) = /3o(^) + f X{s)(d{s,t)ds regressing FA along the right cortico-spinal WMT 
(RCST-FA, Y{t)) on pre-smoothed FA along the corpus callosum WMT (CCA-FA, X(s)). 
Figure 1 shows estimated coefficient surfaces /3(s, t) for penalized spline based fits with sec¬ 
ond or first order difference penalties (two leftmost panels) or a ridge penalty (third panel 
from left), as well as the coefficient surface implied by a functional principal component 
(FPC) based ht (right). Even though the surfaces are quite different, they result in (prac¬ 
tically) identical fitted values in this example for a setting with simple non-identifiability. 
The first and third panels from the left in the top row of Figure 7 show the data used in this 
example. Graphical examples of the second kind of non-identifiability that persists even 
under penalization are shown in Figures 4 and 8 for synthetic and real data, respectively. 

Our paper is structured as follows: Section 2 defines the model and data structure under 
discussion. We present an accessible rephrasing of the fundamental issue (Section 3) and 
derive necessary and sufficient conditions for settings in which f3{s,t) is or is not identihable 
in Section 3.2. Section 4 follows up with an analysis of the scope of the problem based 
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on simulated data, while Section 5 describes a real-world example of the issue. The main 
conclusions we draw from our analysis are that the complexity of observed functional 
covariates puts hard limits on the identifiability of coefficient surfaces in a number of ways, 
that these limits can be diagnosed based on the data at hand, and that pre-processing of 
functional covariates can often exacerbate identifiability issues. 

2. Model and Data Structure 

In the following, bold symbols denote vectors and matrices, and calligraphic letters denote 
function domains, function spaces, or sets of functions or vectors. 

Define a simple function-on-function regression model as 

Yi{t) = j Xi{s)/3{s,t)ds + Eit, (1) 

where Ti(t) and Xj(s), i = are functional responses and covariates on closed 

intervals T and S in M, respectively, and assume that they are realizations of zero-mean 
square integrable stochastic processes Y{t) G L‘^[T] and X{s) G L‘^[<S] with continuous 
covariance functions, respectively. To simplify notation and exposition but without loss 
of generality, we assume that E{Yi{t)) = 0 and E{Xi{s)) = 0. Propositions 3.1 to 3.3 in 
Section 3 directly carry over to any model with an additive predictor that includes terms 
like Xi{s)l3{s,t)ds. The further development in Section 3 leading up to Proposition 
3.4 assumes that the estimate for /3(s,t) minimizes a (penalized) quadratic loss function, 
which is equivalent to maximizing the likelihood of a model with i.i.d. Gaussian errors Eu, 
but does not strictly speaking depend on distributional assumptions about Eu and is also 
very similar to the system of equations solved in each iteration of the penalized iteratively 
re-weighted least squares algorithm (P-IWLS; Wood, 2000) used to fit additive models like 
(1) for non-Gaussian responses. 

Due to the assumptions on the functional covariates, they can be represented by a 
Karhunen-Loeve expansion 

OO 

^*(■5) = ^ Cimfpmis), ( 2 ) 

m=l 

with orthonormal (pmis), (j)rn{s)4>m'is)ds = 5mm’-, and uncorrelated zero-mean PPG 
scores £,im with variances > 1/2 > • • • > 0, m G N. The Vm and 4>m{s),m = 1,..., M, 
are the ordered eigenvalues and eigenfunctions of the covariance operator of 2 f(s), 
respectively, with the covariance function given by Mercer’s theorem as 

OO 

k^{s, s') = E (X(s)X(s')) = ^ r'm(/>m(s)</>m(s')- (3) 

m=l 

Since estimating /3(s,t) is an inverse problem, some kind of regularization is required. 
Functional principal component based approaches like, for example, Yao et al. (2005) 
restrict /3(s,t) to lie in the span of the first K estimated eigenfunctions (pmis), m = 
1,... ,K for all t. The number of eigenfunctions K that is used serves as the (discrete) 
regularization parameter. In contrast, we will discuss and analyze a penalized approach. 
The underlying assumption is that /3(s, t) is a smooth function that can be well represented 
as a linear combination of suitable basis functions defined on 5 x T. 

In practice, functional responses Yi(t) and functional covariates Xi{s) are observed on 
grid points Sj = (sji, •.., SisJ and U = {tn, ..., tjTj- For simplicity, we assume those to 
be identical vectors s,t with lengths S and T, respectively, for each observation i. In the 
following, expressions a(s) or a{t) with a bold argument denote the vector of evaluations 
of a(-) on the respective grid, e.g., a{s) = (a(si),... , 0 ( 55 ))"''. 
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Model (1) can then be approximated for observed data as 


PS w •Xj(s) /3(s,t)+ Si , (4) 

IXT \^sxl Sxl J SXT 

with P{s,t) = W(sj,tk)] j=i,...,s and Si = ( 6 , 4 .,..., We also define a weight vector 

k=l,...,T 

w for numerical integration, e.g. w = for simple quadrature via Riemann 

sums, with Wj the length of the sub-interval of S represented by sj. The symbol • denotes 
element-wise multiplication. The coefficient surface /3{s,t) is represented using a tensor 
product spline basis 

/3{s,t)= Bs 0 (5) 

SxKg KaXK-t KtxT 

with basis matrices Bg and Bf of Kg and Kt basis functions evaluated in s and t, respec¬ 
tively, and spline coefficient matrix 0. The roughness penalty matrix for the surface is 
given by P = P{Xs,Xt) = Xg{Pg ® Ixt) + ® Pt) (Wood, 2006), where Xg,Xt are 

smoothing parameters to be estimated from the data and Pg and Pt are the fixed and 
known marginal penalty matrices for the s- and t-directions, respectively. 

Estimation and inference is described in more detail in Ivanescu et al. (2015) and Scheipl 
et al. (2015). In the following, our considerations are not limited to simple models such as 
model ( 1 ), but carry over to more general models Yi{t) = rji{t) + Xi(s)id(s, t)ds + eu by 
using Yi{t) = Yi(t) — rji{t). The additive predictor rii{t) represents the sum of other terms 
in the model such as a global functional intercept index-varying linear or smooth 

effects of scalar covariates x like Xi/3{t) or f{xi,t), scalar or functional random effects, 
etc. Scheipl et al. (2015) contains methods and applied examples for this class of flexible 
additive functional regression models. Of course, these more general models may suffer 
from additional identifiability problems caused by collinearity or concurvity of the terms 
in the additive predictor which are outside the scope of this paper. While we focus our 
discussion on a spline-based approach, the representation in (5) also accommodates other 
choices of basis functions and penalties. 

3. Identifiability 

In this section, we discuss potential sources of non-identifiability in model (1). The first 
subsection restates some known results on these issues for the theoretical model ( 1 ) with 
truly functional observations Xi{s) and Yi{t). Subsection 3.2 then discusses identifiability 
for the finite resolution vector data available in practice. 

3.1. Identifiability in the theoretical model 

It is well known (e.g. Prchal and Sarda (2007, c.f. p. 5), He et al. (2000, Th. 4.3. c)) 
that coefficient surfaces are identifiable only up to the addition of functions in the ker¬ 
nel of , i.e., if fulfills model ( 1 ), so does + f3K{-,t) for any f3K{-,t) with 

{s,v)l3K{v,t)dv = 0 for all s,t. Thus, we have identifiability only when the kernel is 
trivial. 

Proposition 3.1. The coefficient surface f3(s,t) in (1) is identifiable if and only if 

ke{K^) = { 0 }. 

An important secondary consequence is that predicted responses Xi(s)l3{s, t)ds can be 
entirely unaffected by large changes in /3(s, t). Thus, strategies for detection of identifiabil¬ 
ity problems cannot be based on predictive performance in cross-validation, bootstrapping 
and related methods. 
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Non-identifiability in Proposition 3.1 occurs when 'ke{K^) is non-trivial, when the eigen¬ 
functions in (2) with non-zero eigenvalues Vm do not span the L‘^[S]. While it is possible to 
assume a trivial kernel in theory (e.g. Prchal and Sarda, 2007, equation (4)), in practice, 
functional covariates are observed on a hnite number of grid points S and the empirical 
covariance for N observations thus can have at most min(A^, S) non-zero eigenvalues. As is 
exploited in functional principal component analysis (e.g. Ramsay and Silverman, 2005), 
functional observations are often simple enough to be represented accurately by a rela¬ 
tively small number of eigenfunctions, with eigenvalues of higher order small compared to 
noise or measurement error. It is also wide-spread practice to use pre-smoothed versions 
of observed functional covariates as inputs for models like (1) (e.g. James, 2002; Ramsay 
and Silverman, 2005), and these will have a non-empty kernel since they are represented as 
linear combinations of a limited number of basis functions. Basis function representations 
of A(s) are also used when sparsely or incompletely observed functional covariates have to 
be imputed on a grid of s-values to be used as inputs for model (1) (e.g. Goldsmith et ah, 
2011 ). 

In the following section, we thus investigate identifiability problems for finite-sample 
finite-resolution functional data and the interplay between the rank of the observed covari¬ 
ance, the rank of the basis used to represent /3(s,t) in s-direction and the penalty used in 
the penalized estimation approach for /3(s,t) introduced in Section 2. 

3.2. Identifiability in practice 
Rank-deficient design matrix 

In practice, is represented as a linear combination of a finite number KgKt of basis 

functions, see (5). For the following, we will assume that the corresponding approximation 
error is negligible and that a suitably flexible basis has been chosen for /3(s, t). Combining 
(4) and (5), we can write the model as 

Y=XWBs&Bj+e, (6) 

NxT NxS SxS SxKs KsXKt K^xT N xT 

where 1" = [Yi{tj)]i=i,...,N , X = [Ai(s«)] i=i,....]v , W = diag(m) and £ = leuA . Using 

yec{ACB) = (S^ (g) A) vec(C') yields 

vec(l") =[ Bt ®{X W Bs )]vec(©) +vec(e) . (7) 

JVTxl TxK^ NxS SxS SxKs KfKsXl NTxl 

In the linear regression model (7) for y = vec(Y), the parameter vector 0 = vec(©) is 
identifiable if and only if the design matrix D = Bt® [XWBg) is of full column-rank. 

The rank of D is equal to rank(Z)) = rank(St) rank(AiVUSs). Bt will typically be of 
full rank Kt as long as Kt < T, as the Kt spline functions form a basis and the columns 
of Bt are thus linearly independent for non-pathological cases. For X, let W = be 
the empirical version of the Karhunen-Loeve expansion (2), where X~^X = with 

an S X M orthonormal matrix of eigenvectors, M = rank(X), A = diag(Ai,..., Xm) a 
diagonal matrix of ordered positive eigenvalues and H containing M columns of estimated 
scores with empirical variances Ai,..., Xm- Then, by construction, the matrix XWBg is 
at most of rank min(A, M, Kg, S) = min(M, Kg), since M < min(A, S). 

We then have the following proposition: 

Proposition 3.2. Assume that Bt is of full rank Kt- Then, the design matrix D = 
Bt ® (XWBg) in model (7) is rank-deficient if and only if 

a) M < Kg or 

b) ifM> Kg, but ranki^WBg) < Kg . 
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Proof. See Appendix A.l. 


□ 


Proposition 3.2 yields a direct criterion to check for rank-deficient design matrices. Case 
a) corresponds to a low-rank covariance for the A-process. In this case, the functional 
predictor does not carry enough information, as measured by the number of eigenfunctions 
4>m{s) with non-zero eigenvalues, compared to the number of parameters to estimate. 
Case b) means that even if M > Kg, non-identifiability can occur if the span of the basis 
used for /3{s,t) in s-direction contains functions in k.e{K^), as measured by numerical 
integration using the integration weights w. More intuitively, this means that the basis 
for the s-direction of f3{s,t) accommodates modes of variation orthogonal to those of the 
A(s)-process. 

Note that identifiability is determined by the interplay between the complexity of the 
A(s) and the coefficient basis for (3{s,t). Thus, more data will not necessarily resolve 
identifiability issues: A finer grid s will only eliminate identifiability problems present on 
a coarser grid if there is sufficient small-scale structure in A(s) that is also present in the 
basis used for /3(s,t) in s-direction. Increasing the sample size N will likewise eliminate 
problems with identifiability only if the low-rank of the covariate’s covariance is due to 
small sample size. 

The effect of the penalty 

In cases of non-identifiability, the best we can hope for is partial identifiability of the 
parameters in a parameter subspace, i.e., identifiability under additional assumptions on 
the parameters. In this vein, functional principal component regression (e.g. Yao et ah, 
2005) restricts /3(s, t) to lie in the span of the first K eigenfunctions <^m(s), m = 1,..., K, 
for all t. Remaining problems then include the fact that the fmis) are estimated quantities 
in practice, with corresponding measurement error, and the choice of K, which can strongly 
affect the shape of the resulting function estimate (Crainiceanu et ah, 2009). Also, this 
approach couples assumptions on the shape of f3{s,t) to properties of the space spanned 
by the retained 4>mis). In particular, their smoothness determines that of (3{s,t), and 
inclusion of higher-order eigenfunctions often leads to wiggly surface estimates that are 
hard to interpret and unstable under replication. 

Here, we focus on a penalized approach assuming smoothness of /3(s,t) and investi¬ 
gate the effects of the penalty on identifiability. While the use of a penalized approach is 
well-known to avoid identifiability problems due to high correlation between observations 
at neighboring grid points (e.g. Ramsay and Silverman, 2005, Ch. 15.2), the full inter¬ 
play between penalty and identihability is, we believe, not fully understood and under- 
appreciated. 

Consider again the design matrix D = Bt® {XWBg) of rank d with the singular value 
decomposition D = = {Vt®Vs){'Et®'Es){Uj ®Uj), with Vt'EtUj and Vs'SgUj 

the singular value decompositions of Bt and Dg := XWBg, respectively. Let indices 
+ and 0 denote the corresponding sub-matrices obtained by removing columns and/or 
rows corresponding to zero and non-zero singular values, respectively. We assume in the 
following that Bt is of full rank Kt- Then, D = Y+Sl+Ll)/ = (V) ® "14+)(51* ® 'Sg+){Uj ® 



Thus, for any given 0* with D0* =: /, there exists a linear subspace PLf d 
of dimension {KgKt — d) given hy PLf = {6 e : DO = /} = {0* + 9^:6^^ im(LIo)} 

that yields identical fits f. 

If we assume our parameter function to come from a space of smooth functions, we can 
select the smoothest solution on a given hyperplane PLf hy minimizing 9^P6 for a suitable 
penalty matrix P that penalizes roughness of the function parameterized by 9. Simple 
non-identifiability occurs if this minimum is unique, persistent non-identifiability occurs 
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if it is not. We have the following proposition regarding uniqueness of the corresponding 
minimum. 

Proposition 3.3. Let P = Xs{lKt ® Ps) + XtiPt ® Iks); with Pg and Pt positive semi- 
definite matrices. Assume that Bt is of full rank Kt, that rank(Pt) < Kt and that Xg > 
0, Ai > 0. Then, for any f G im(iD) there is a unique minimum min^g-^^ {e^PG} if and 
only ifk.e{DjDs) n ke{Pg) = {0}. 

Proof. See Appendix A. 2 . □ 

The assumption that Pt is of less-than-full rank is natural in the context of derivative- 
based penalties and excludes cases like the ridge penalty Pt = Ixt > which would have the 
same effect as a full-rank penalty Pg = Irs^ even in cases of a kernel overlap between 
DJDg and Pg. A potentially full-rank Pt would change the ’if and only if’ in Propositions 
3.3 and 3.4 below to ’if’. 

Proposition 3.3 shows that in the case of a kernel overlap ke(Z)J Dg )nke{Pg) {0}, 
the additional side condition 6^ PG —)• min does not yield a unique smoothest point on the 
hyperplane and the model remains unidentifiable. On the other hand, if there is no kernel 
overlap, there is a unique smoothest point 0 / G P/ and this unique point has the form of 
a projection of G along the hyperplane. Note that for the ridge penalty P = XIrsKi^ 
obtains the projection onto the image im(iD), which sets the part in the kernel of D^D 
to zero. More generally, a smoothness penalty P generates a projection that may have a 
non-zero component in the kernel of D^D if this yields a smaller overall penalty value. In 
this case of no kernel overlap, we thus have a weak form of identifiability, which guarantees 
that there is a unique smoothest representative on any hyperplane of parameters giving 
the same conditional distribution for Y. 

This characterization, which only requires checking of design matrix and penalty in s- 
direction and not for the full model, carries over to the penalized maximum likelihood or 
least squares estimation problem 

min{||y - DGf + XgG'^ {Ir, ® Pg)G + XtG^{Pt ® IrJG} ( 8 ) 

0 

for some > 0, A^ > 0. 

Proposition 3.4. Assume that Bt is of full rank Kt, that rank(Pi) < Kt and that Xg > 
0,Xt > 0 . Then, there is a unique penalized least squares solution for ( 8 ) if and only if 
ke{D'jDg)nke{Pg) = {0}. 

Proof. See Appendix A.3. □ 

Proposition 3.4 gives a criterion for the uniqueness of the penalized least squares solu¬ 
tion. We can show how the penalty achieves this uniqueness by writing ( 8 ) as a nested 
minimization problem. Here, the outer minimization finds the / = DG with optimal fit to 
the data, and the inner minimization minimizes the penalty term over LLf to obtain the 
smoothest solution for a given level of residual variation. 

min{||?/ - DG\\'^ -\- XgG~^ {Ir^ ® Pg)G + XtG^{Pt ® 

0 

= min min{||y — /|P -|- G^PG s.t. DG = /} 

/eim(D) e 

= , min {||y - DGff + G]PGf} 

f£im(D) 

= min {||y - P+S+u+f + vlulH^PHU+v+ 

where Gj = HU+Vj,. for a given f is uniquely defined as in ( 11 ), with f and 

H = {Ir^r, - Uo{U;^PUor^UjP), if ke(T>7T>,) n ke{Pg) = {0}. 


As ^4-^+ is a matrix of full column rank d, this minimization problem has a unique 
solution that, for given A^, At, balances the fit to the data and smoothness. 

To summarize, in the case of no kernel overlap, i.e., ke(iAj Ds)nke{Ps) = { 0 }, we obtain 
a weaker form of identifiability even when the design matrix D is not of full rank, which 
guarantees that there is a unique smoothest representative on any hyperplane of parameters 
giving the same conditional distribution for y. Then, there will also be a unique solution to 
the penalized estimation problem, which is the smoothest representative on the hyperplane 
of possible solutions with equally good fit. 

In practice. A* and Xt are estimated from the data. We here do not investigate the more 
complex case when Xs,Xt are not fixed. It should also be noted that {D^D + Xs{Ik, ® 
Ps) + Xt{Pt 0 Iks)) may still be close to singular even in cases of no kernel overlap if 
smoothing parameters are very small, with corresponding reduced stability in estimation. 

3.3. Diagnostics, practical recommendations and countermeasures 

In order to safeguard against misleading coefficient estimates in practical applications 
of functional regression, it is necessary 1) to develop empirical criteria for diagnosing 
problematic data settings in which the coefficient function is not identifiable based on 
the available data, or where only the penalty ensures unique estimates, 2) to avoid pre¬ 
processing protocols and tuning parameters that increase the likelihood of identifiability 
issues and 3) to develop improved algorithms for estimating function-on-function effects 
that are less prone to severely misleading estimates in problematic settings. 

Diagnostics We are interested in identifying settings with simple non-identifiability in 
which only the penalty term guarantees the existence of a unique solution. Following 
Proposition 3.2, the most direct approach to do so is to compute the condition number 
of DJDs = (XWBs)'^ XWBs and choose a suitable cut-off (10®, in the following) for 
numeric rank deficiency. 

In addition, propositions 3.3 and 3.4 indicate that a measure of the degree of overlap be¬ 
tween the spans of ke(iAj Dg) and ke(Ps) can be used to detect persistent non-identifiability 
that remains despite the penalization. In our empirical evaluation of such measures, we 
found that a measure for the distance between the spans of two matrices introduced in 
Larsson and Villani (2001), when modified for our setting, showed the most promise as 
the resulting measure is free of tuning parameters and can be computed quickly from the 
data. 

In particular, we modify the original definition of Larsson-Villani in order to accommo¬ 
date two matrices of unequal column numbers. We then define the amount of overlap PIlv 
between the span of two matrices A G B G n > pa,Pb-, by 

f|^^(A, B) = trace(Fj VaVJVb). 

Here, Vz is a matrix containing the left singular vectors of the matrix Z and is thus an 
orthogonal matrix spanning the same column space as Z, Z G {A, B}. It is easy to see 
that this measure is symmetric, {^^y{A,B) = A). Similarly to Theorem 2 in 

Larsson and Villani (2001), one can also show that P|^y(A,P) G [0, min(pA,pB)], with 
the overlap assuming its maximum of va.va.[pA,PB) iff A G im(P) or P G im(A) and its 
minimum of 0 iff A G im(P_L) or S_l G im(A), where the n x (n — pz) matrix Z± is the 
orthonormal complement of Z, Z G {A,B}. 

To measure the degree of overlap between the kernels of DJ Dg and of Pg , we could use 
f]LviiDjDg)B ,Pg±), as the span of {DJDg)± corresponds to the kernel of DJDg and 
the span of Pg± corresponds to the kernel of Pg. In the following, we will however use 
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as this choice obtained slightly better sensitivity and specificity for the detection of prob¬ 
lematic settings in our simulations in Section 4. Moreover, one can easily show that 

ke(r>J Ds) n ke(P,) = {0} 4^ ke(X^X) n {WBsX \ x £ ke(P,)} = {0} 

such that the two formulations address the same question. For W = Is, this measure has 
the interpretation of the overlap between the empirical null-space of the observed X(s) 
process or ke{K^) and Vs±, the space of functions not penalized by the penalty defined 
by Ps, evaluated on the grid given by s. It can be determined quickly and accurately 
before the model is fit. Problematic cases are indicated by overlap measures > 1, as this 
is indicative of an at least one-dimensional snb-space of functions contained in the kernel 
overlap. 

Practical recommendations The theoretical results suggest several recommendations for 
pre-processing of functional covariates and choice of the penalty in practice. 

1. Pre-smoothing of functional covariates is commonly done to remove measurement 
error and/or obtain functions on a common grid (e.g. James, 2002; Ramsay and 
Silverman, 2005; Goldsmith et ah, 2011). If the resulting (effective) rank of the 
smoothed covariate process drops below Kg, this will lead to models that are only 
identifiable through the penalty term. We thus recommend to use a sufficiently large 
number of FPCs and/or spline basis functions if such pre-processing is required. 

As a peculiar consequence of this point it may be preferable in some cases to accept a 
small amount of measurement error-induced attenuation in /3(s, t) based on noisy, un¬ 
processed Al(s) in order to avoid a potentially much larger non-identifiability-induced 
error in /3{s,t) based on low-rank, pre-processed Jh(s). 

2. Curve-wise centering of functional covariates such that ^ for all i is 

sometimes used e.g. in the context of spectroscopy data to remove the optical offset 
(c.f. Fuchs et ah, 2015). Then, constant functions lie in the kernel of K^, ke{K^). 
This is not recommended if a penalty is used that does not penalize constant functions 
(as most difference or derivative-based penalties do) to avoid non-identifiability. 

3. Penalties with larger null-spaces increase the likelihood of a kernel overlap and re¬ 
sulting non-identifiability problems. For difference or derivative-based penalties, for 
example, a penalty penalizing deviations from constant functions (first order differ¬ 
ences or derivatives) would thus be preferable in this sense to higher-order differ¬ 
ences/derivatives. Constant coefficient functions, which then span the penalty null- 
space, correspond to models with the mean over the functions as covariate - which are 
often used by practitioners - and thus also lend themselves to intuitive interpretations. 
In particular, for penalties where constant coefficient functions span the penalty null- 
space, it is straightforward to see that only X-processes that are centered curve-wise 
will result in a kernel overlap (unless smoothing parameters are estimated to be very 
small). Unless curve-wise centering is performed as discussed in 2., such processes 
will typically occur rarely and using first-order difference/derivative penalties should 
thus guard against many if not most serious identifiability issues in practice. 

Countermeasures The third point above suggests modifications of the penalty null-space 
to avoid non-identifiability caused by potential overlap between the penalty null-space Vs± 
and ke{K^). Alternatively, the estimated coefficient surface can be constrained to be 
orthogonal to functions in the overlap of Vs± and ke(A'^). We describe three approaches 
using penalties with empty null-spaces and one constraint-based approach; a systematic 
comparison of their performance is given in Section 4.2. 
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1. The simplest approach is the use of a simple ridge penalty Pg = Irs- However, the 
resulting estimates will typically not have good smoothness properties as the ridge 
penalty is not a roughness penalty in the conventional sense and its bias towards small 
absolute values of /3{s,t) may increase estimation error, cf. Figures 1, 4 and 5. 

2. A second approach uses a modified marginal penalty matrix without null-space along 
the lines of the so-called “shrinkage approach” described in Marra and Wood (2011) 
and originally developed for the purpose of variable selection in generalized additive 
models. Marra and Wood (2011) replace the marginal penalty Pg = T diag(/9 i,..., prJT^, 
with eigenvectors contained in F, eigenvalues pi,, pR^ and rank Kg < Kg, by a full 
rank marginal penalty 

= r diag(pi ,...,Pk^, epp.^,..., epp.^V'^. 

This substitutes the zero eigenvalues pp- , pR^ with epp for all k = Kg + 

l,...,Kg using 0 < e <C 1, and thereby adds a small amount of penalization to 
parameter vectors in the null space of the original penalty. In the following, we will 
refer to such a modified penalty as a full-rank penalty. By imposing a small degree 
of regularization on functions in Vg± one can in principle preserve the attractive 
smoothing properties of the original penalty while still avoiding large artifacts due 
to non-identifiability resulting from a kernel overlap. We use e = 0.1 as suggested 
in Marra and Wood (2011), but results show that this choice is not always effective 
in removing artifacts if the estimated smoothing parameter is very small and overall 
penalization of the fit is weak. 

3. In a similar effort to avoid spurious estimates in scalar-on-function regression, James 
and Silverman (2005, their eq. (16)) suggested using the empirical FPCs of Al(s) 
scaled by their inverse eigenvalues as a penalty. This penalizes coefficient functions 
with large variability in directions in which Jf(s) varies very little or not at all (i.e., in 
ke(iF^)). We adapted this approach to our function-on-function setting by replacing 
the conventional difference operator based B-spline penalty matrix with 

min(A'',S') 

Ps = bJ ^ diag (w ■ Bg 

m=l 

with estimated FPCs and eigenvalues 4>m{s) and Um, rn = 1,..., min(A', S'). For a 

given to, this penalty matrix approximates the marginal penalty term to))‘^ds 

suggested by James and Silverman (2005). 

The empirical (pmis) and km have to be estimated from a singular value decomposition 
of X. It is unclear, however, how to compute the inverse eigenvalues if X is of low 
rank, i.e., if some of the km are (numerically) zero, which is of course precisely the 
setting in which this penalty might yield more stable estimates. In our experiments, 
we tried replacement of the zero eigenvalues with the smallest non-zero eigenvalue 
and a variety of other replacement schemes, but results from this approach seem to 
be fairly sensitive to both the chosen replacement scheme for zero eigenvalues and to 
the estimated FPCs themselves. 

4. Instead of the augmented or alternative penalization schemes for components of the 
coefficient surface that lie in the overlap ke(iF^) r\Vg±, we can also directly constrain 
these components to be zero. As these components are not estimable from the data, 
such constraints are a plausible default but need to be taken into account for the 
interpretation of the estimated surface. To implement them, we compute a basis 
of the overlap and constrain the coefficient surface evaluated on the observed grid 
to be orthogonal to vectors in the span of that basis. Let Xj_ = (X''^)_|_ and the 
S X Kg matrix Bg containing the marginal basis functions over S evaluated on s. A 
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basis spanning the overlap is then defined by the left singular vectors Vcs+ of the 
matrix Xj^ (JCj)"'" diag(m)S 5 Ps± that have positive singular values. 
For intuition, consider that if m = 1, the expression above projects BsPs±, i.e., a 
basis for Vs±, into the span of Xj, i.e, a basis for ke(iF"’^). This yields a basis for 
the intersection of the spans of BgPgx and Xj_. The expression above is cheap to 
compute in a numerically stable way using the QR decomposition of Xj. We then 

estimate 0 under the constraints dia.g{w)Bs& = 0. To give an example, the 
constrained coefficient surface estimate for a curve-wise centered functional covariate 
that does not contain any constant components will be centered around zero and 
not be offset in any direction if the penalty null-space includes constant functions. 
Note that, different from the augmented or alternative penalties described above, 
these constraints are empty if ke(X'’‘-) n Vs± is empty and penalties with constraints 
then reduce to the usual penalties without constraints. The constraint definition 
above is quite general and can be used for any basis with a quadratic penalty, it 
is not limited to the difference penalties we focus on. Specifically, it will also be 
applicable for derivative based penalties as well as some of the PDE-based penalties 
introduced by Ramsay et al. (2007) whenever identifiability issues arise. By default, 
the pffr function uses the diagnostic criterion (9) combined with a check for a rank- 
deficient design matrix to determine the presence of persistent non-identifiability. 
If persistent non-identifiability is found, a corresponding warning is issued and the 
constraints developed here are applied to the fit. Interpretation of the constrained 
coefficient surface estimates requires careful attention. For example, if constrained 
surface estimates are centered around zero because ke(X'’*') n Vs± contains constant 
functions, the signs of different regions of (3{s,t) are not interpretable due to the 
estimated absolute level of /3(s,t) being essentially arbitrary. 


4. Simulation study 

This section presents results on the practical consequences of the theoretical development 
in Section 3. Specifically, we investigate the performance of the tensor product spline- 
based approach given in Section 2 on artificial data of varying complexity and noise levels 
in terms of estimation accuracy and use the simulation results to validate the diagnos¬ 
tics for problematic settings we have developed. Subsequently, we present results for the 
modified full-rank penalties of Marra and Wood (2011), the FPC-based penalty of James 
and Silverman (2005) and the constrained estimates described in Section 3.3. All models 
were fitted with the pffrO function available in the refund package, which estimates the 
smoothing parameters using restricted maximum likelihood (REML). 

4.1. Simulation setup 

We simulate data from data generating process ( 1 ), with n = 50 subjects, T = 5 = [ 0 , 1 ] 
and S = 100 grid-points for Xi{s) = Y.m=i ^irn4’m{s). The effect surface f3{s,t) is esti¬ 
mated using tensor product cubic B-splines. We set Kt = 10 and use a marginal first order 
difference penalty for the t-direction. Test runs showed that results are insensitive to the 
number of grid-points for the response; we used T = 50 grid-points for Y{t). Twenty repli¬ 
cations were simulated for all sensible combinations of the following parameters (144000 
replicates in total). Eor the fitting algorithm, we vary 

• the marginal roughness penalties for the spline coefficients: either second order 
difference penalties (“A^”) or first order difference penalties (“A^”). Eor the second 
order differences, coefficient vectors in the penalty’s null-space ke(P(As,At)) param¬ 
eterize surfaces that are constant or linear in both directions. Eor the first order 
differences, coefficient vectors in ke(P(As,At)) parameterize constant surfaces. Note 
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that rank deficiencies can increase if associated smoothing parameters become suffi¬ 
ciently small. In that case, the penalty null-space effectively becomes the entire span 
of the associated basis functions. 

• and the number of basis functions over S: Kg G {5, 8,12}. 

For the data generating process, we vary the following parameters: 

• number of eigenfunctions for the X(s)-trajectories with non-zero eigenvalues: 
M G {3,5,8,12,20}. This means we have settings with M < Kg and M > Kg for 
most M. Note that the effective numerical rank of a simulated X can be (much) 
lower than M depending on the speed with which the eigenvalues decrease. 

• signal-to-noise ratio: SNR^ = ^ {2,10,1000}, where sd(x) is 

the empirical standard deviation of x. This corresponds to high and intermediate 
noise levels for realistic scenarios as well as settings with almost no noise to check the 
theoretical properties. 

• FPC systems for Xi(s) = with ^ N^(0,r'm)? with different 

patterns of decrease in the eigenvalues Um of the covariance operators: either a linear 
decrease {um = exponential decrease {um = exp(—Some 

of these processes are constructed so that their covariance operator kernels ke(iF^) 
include functions in the penalty null-space Vgx- 

— Poly: eigenfunctions are orthogonal polynomials of degree 0 to M — 1 with linear 
or exponentially decreasing eigenvalues {Poly,Lin and Poly,Exp, respectively). 
For Poly, ke(iF^) is disjunct from Vg±, since the first and second eigenfunctions 
are constant and linear polynomials. 

— Fourier: eigenfunctions are those of a standard Fourier basis. Although a com¬ 
plete Fourier basis is a basis for all square-integrable functions, in practice the 
kernel ke(iF"’^) of a truncated Fourier basis contains functions that are very close 
to the constant since no linear combination of a finite set of Fourier basis func¬ 
tions yields an exactly constant function, so ke(iF^) is not disjunct from Vgx. 
We used this basis with constant (t'm = 1 for Fourier, Const) or exponentially 
decreasing (for Fourier,Exp) eigenvalues Vm- 

— Wiener: eigenfunctions and eigenvalues are those of the standard Wiener process 
on [0,1]: 4>m{s) = \/2sin (7r(m — 0.5)s); Vrn = (f(2m-|-l)) The ke(A'^) is 
close to Vg±_ in this case as no linear combination of a finite set of these basis 
functions yields an exactly constant or linear function. 

— BrownBridge: eigenfunctions and eigenvalues are those of the standard Brownian 
bridge on [0,1]: 4>m{s) = \/2sin (yrms); Um = The ke(A''’*') is close to Vs± in 
this case as no linear combination of a finite set of these basis functions yields an 
exactly constant or linear function. 

— Poly(l+), Poly(2+), Poly(-l): eigenfunctions are orthogonal polynomials of de¬ 
gree 1 [2] to M [M -|- 1] for Poly(l+) [Poly(2+)], so that ke(Ar'’^) includes the 
(complete) null-space of the rank-deficient penalties, i.e., the constant [and linear] 
functions. Poly(-l) has polynomial eigenfunctions of degree {0, 2,3,4,..., M-|-1} 
so that ke{K^) overlaps the null-space of the second differences penalty but not 
the first differences penalty. All three processes are associated with linearly de¬ 
creasing I'm- 

From top to bottom, these processes become increasingly more “antagonistic” in the 
sense that 1) the kernels of these eigenfunction systems move increasingly closer to 
the kernels of the penalties we consider and 2) more quickly decreasing eigenvalues 
result in lower effective rank of the observed A(s). 

• coefficient functions /3(s,f) = Bg&Bj are drawn randomly for each setting. The 
associated coefficients are drawn as vec(0) N (o,{0.11 + P{Xs,\t)) ^'), where 
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P{Xs, Xt) is a first order difference tensor penalty matrix. 

— The marginal B-spline bases Bg and Bt have either 4 or 8 basis functions for each 
direction. 

— Xg = Xt are either 0.1 or 1. 

This generates coefficient surfaces of varying complexity and roughness. We do not 
fit models where the basis used to generate f3{s,t) is larger than that used to estimate 
I3{s,t) since that could introduce a distracting approximation error not relevant to 
the issues at hand. 


In order to make results comparable across the different settings, we use the relative inte¬ 
grated mean squared errors rlMSE^? = and rIMSEy = ^ 

where Yi{t) is the mean of Yi(t) over t. 


‘^dt 


4.2. Results 

Identifiability The estimation accuracy for Y{t) (not shown) is excellent across the board 
even for the very noisy settings, with 90% of relative integrated mean square errors below 
0.01 and a median of 0.00051. Estimation accuracy for /3(s,t), however, varies wildly over 
a range of 18 magnitudes between 1.3 x 10“® and 2.2 x 10^*^. Further analysis shows that 
the simulation study design succeeds in creating the identifiability issues described by the 
results in Section 3.2. To quantify the severity of identifiability issues, we compute rank 
correlations between rlMSE^ and rIMSEy over the 20 replicates of each simulation setting. 
As expected, we observe low or even negative correlations mostly for settings in which Dg 
is rank-deficient. This effect increases both for lower signal-to-noise ratios and for more 
complex true shapes of /3(s,t). For intuition, consider that the “best” solution for (8) 
for any given error will be the smoothest surface (i.e., the one with the smallest penalty 
term) in the set of surfaces that can be generated by adding functions from ke(Ar^) to 
any initial /3(s,t) with the given error. This may be quite close or quite far from the true 
/3(s,t), depending on the specific setting, with more noisy data and more complex true 
shapes more likely to result in fits that are quite far from the truth and still producing 
good model fit. 


Estimation performance for I3{s,t) Figure 2 shows the estimation errors for coefficient 
surfaces generated with SNR^ = 10. The right column shows results for numerically rank 
deficient Dg (i.e., condition number k{DJD g) > 10®), the left column for designs with 
K {DjDs) < 10®. The top row shows results for first differences penalty, bottom row 
for second differences penalty. Boxplots are grouped by the amount of overlap between 
ke(Ar^) and Vg± as computed by C\x^'Psi_ color-coded for the different processes 

the A(s)-trajectories are sampled from. Results for SNR^ = 2 and 10® were qualitatively 
very similar - errors obviously become larger for noisier data but the pattern shown in 
Figure 2 remains the same. Note that relative estimation errors below ~ 0.01 correspond 
to estimates that are visually indistinguishable from the true surfaces, and that errors below 
~ 0.1 (thick black horizontal line) usually preserve most essential features of the true /3(s, t) 
well. Results with rIMSEy 3 > 1 bear little resemblance to the “true” function. Also recall 
that all of these fits, with rlMSE^ values varying by more than 14 magnitudes, resulted 
in a comparatively small range of rIMSEy between 10“® and 10“®. Closer inspection of 
results shows that the extremely large errors for Poly(l+), Poly(-l) and Poly(2+) are 
caused by the expected behavior: estimates are shifted by functions from the overlap of 
ke(Ar^) and Vg_i_. The top row of Figure 4 shows an example of this behavior: the estimate 
for first order difference penalty is shifted by a large constant, while the estimate for the 
second order difference penalty is shifted by both a constant and a huge linear trend in 
s-direction. The htted values of all models shown in Figure 4 are practically identical. 
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Figure 2: Boxplots for relative integrated mean square error rIMSEp for all 18000 results for SNR^^ = 10 with 
conventional difference penalties. Columns show results for full rank Ds versus numerically rank 
deficient Ds ■ Rows show results for the first and second order difference penalties. Boxplots grouped 
by overlap between ke{K^) and Vs± as computed by 0 x^ 73 ,^ ^ color-coded for the different processes 
the X(s) are sampled from. Vertical axis on log^g-scale, extreme errors > 10® for Poly(2-|-) are cut 
off. 


These results mostly corroborate the results derived in Section 3 - we see that: 

• Serious errors rlMSE^ >0.1 are rare for both full-rank and rank-deficient Dg if the 
generating process for X(s) is not antagonistic in the sense that \ie{K^)r\Vsy = {0}, 
i.e. for the Poly, Fourier, Wiener and BrownBridge processes. 

• As long as ke(Ar^) n Vgi. = {0} (approximated numerically by the criterion that 
V\x^Vsi_ ^ 0.95), regularization of the estimated coefficient surface allows us to 
achieve good estimates even if the unpenalized regression model per se would not 
be identifiable due to rank deficiency of Dg. 

• The larger Vg± (top to bottom), and the closer ke(Ar^) is to Vg± (left to right in each 
group of boxplots), the larger the likelihood of estimates far from the truth and the 
larger the average rlMSE^. 

Diagnostics As in Figure 2, we use ^ measure of the overlap between 

ke(Ar'’‘-) and Vg±. We consider a replicate to be “flagged” as problematic if both — 

0.95 and K{Dj,Dg) > 10®. 

Figure 3 shows a mosaic plot of the contingency table of “flagged"’ replicates and catego¬ 
rized rlMSE^. While the sensitivity for identifying replicates with rlMSE^ > 1 is 0.92, the 
specificity is only 0.28. The sensitivity for identifying replicates with rlMSE^ > 0.1 is 0.58, 
the specificity is 0.09. These fairly low specificities indicate that the penalized approach to 
function-on-function regression discussed here can outperform theoretical expectations and 
frequently finds good solutions even in very difficult settings. Total accuracy for identify¬ 
ing settings with rlMSE^ > 0.1 is 0.88 and 0.94 for rlMSE^ > 1. The positive predictive 
value (precision) of the criterion for rlMSE^ > 1 is 0.72, while it is 0.91 for rlMSE^g > 0.1. 
The negative predictive value of the criterion for rlMSE^ > 1 is 0.99, while it is 0.87 
for rlMSE^ >0.1. Also note that certainly not all errors in the (0.1, l]-range are due to 
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Figure 3: Mosaic plot for contingency table of ’’Bagged” replicates and and categorized rlMSEp. 


identifiability issues, so not all of the (rare) non-detections are failures of the criterion. 

Performance of modified penalties We broadened the scope of our simulation study 
by additionally comparing the performance of the non-standard penalties and constraints 
introduced in Section 3.3: 

• a full-rank ridge penalty (“A*^”), 

• the modified full-rank roughness penalties as suggested by Marra and Wood (2011); in 
our case we used both full-rank first order differences penalties (“A^”) and full-rank 
second order differences penalties (“A^”), 

• the FPC-based penalty of James and Silverman (2005) (“ke(iF^) (FAME)”). We 
replace Vm by max(z>m) 10“^^Pi) in order to remove any (numerically) zero or negative 
eigenvalues, 

• and conventional first and second order differences penalties with additional con¬ 

straints (“A^ +C‘\ “A^ -1-^“) that enforce orthogonality of the estimated coefficient 
surface to functions in ke(A'^) n Vs± if the data are flagged as problematic by our 
diagnostic criterion {k{DJDs ) > 10® and C\xj^v ±_ ^ Note that A^ -|- C and 

A^ -|- C reduce to A^ and A^, respectively, if ]se{K^) n Vsi. = 0 and need not be 
compared to the other techniques in that case. 

Figure 5 shows the rlMSE^ for SNR^ = 10 for the different A(s)-processes and penalties. 
Note that, in contrast to Eigure 2, colors now represent the different penalties, not the 
different A(s)-processes. Boxplots for A^ and A^ contain the same results as those shown 
in Figure 2. 

The full-rank difference penalties A^ and A^ (cyan and turquoise boxes) seem to dras¬ 
tically reduce the size and likelihood of severe estimation errors in the difficult settings, 
especially compared to A^ (dark blue). We occasionally incur slightly worse estimates for 
the easier settings, but these differences are small and hardly relevant here. However, we 
have found that neither A^ nor A^ with e = .1 are effective in removing non-identihability 
artifacts if the estimated smoothing parameter is very small, i.e., if the effect of the penalty 
on the fit is weak. This explains the large outliers observed for some replicates for the an¬ 
tagonistic A-processes (top row) for A^ and A^. Figure 8 (bottom row, leftmost two 
panels) shows this for the dti data set, where the estimated smoothing parameters are 
small. Note that no such outliers occur for the ridge (A®) and FAME penalties (green 
boxes). However, both penalties are not competitive for numerically rank deficient Dg for 
the non-pathological A(s)-processes in the bottom two rows and perform worse than the 
(modified) difference penalties in all other settings, and so cannot be recommended for 
general use as well. Furthermore, in the simulated settings we use, the true /3(s,t) are 
centered around 0, which reduces the negative effect of these penalties’ bias towards small 
absolute values of f3{s,t). 
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Figure 4: Example estimates for different penalties in a very difficult setting with X{s) from Poly(2+), M = 5, 
Kg = 12, SNRg = 2. Center left panel: true coefficient surface. Top row, left to right: first order 
difference penalty, second order difference penalty, ridge penalty, FAME penalty. Bottom row, left to 
right: full-rank first and second order difference penalties, first and second order difference penalties 
with constraints. Note the different z-axis scales in the two left panels of the top row. Subtitles give 
rIMSEp and rIMSEy for each fit. 
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Figure 6 shows results only for simulated data sets flagged as problematic by our diag¬ 
nostic criterion for or A^. In such settings, it makes sense to use difference penalties 
combined with ’’orthogonal-to-null-space-overlap” constraints on the fitted surface, these 
results are denoted by A^ -|- C or A^ -|- C (violet and purple), respectively. 
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Figure 5: rIMSEp for SNR^ = 10 for the 8 X{s)-processes (panels) and the 8 different penalties (color). 

Separate boxplots in each panel for settings with numerically rank deficient Dg (k{DJDs) > 10®) 
versus settings with full rank Dg. 


Figure 4 shows illustrative exemplary fits for the modihed penalties and penalties with 
constraints in a data setting with k{DJDs) > 10®. in the two rightmost panel of the top 
row and the panels in the bottom row. While rIMSEy is very similar for all 8 penalties 
except FAME (top right), the shapes of the corresponding coefficient surface estimates 
differ from each other in the expected fashion: The ridge penalty’s bias towards small 
|/3(s,t)| and tendency to under-smooth the estimated surface is visible, as is the typical 
wiggliness of FAME fits which are also not roughness penalized in the conventional sense. 
Compared to the result for A^, the fit for A^ is much closer to the level of the true /3(s, t) 
and adds a much smaller (spurious) constant to the fit. Similar remarks apply for the 
comparison between A^ and A^: The huge spurious linear trend in s is almost completely 
removed. The overlap criterion flxx'P ^ is ~ 1 for A^, ~ 2 for A^ and 0 by construction 
for all others. Both constrained fits (A^ + C, A^-fC, two bottom right panels) completely 
suppress the spurious constant (and trend) present in the unconstrained fits (two top right 
panels), and A^ -|- C yields a slightly smoother fit than A^ -|- C as expected. 

Note that the trne (3{s,t) are centered around 0 in our simulation. As all modified 
penalties and penalties with additional constraints result in estimated surfaces around 
0, results will be shifted up- or downwards compared to the true surfaces if this is not 
the case in a real application. Such a shift reflects the fact that the average level of the 
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Figure 6: rIMSEp for SNR^ = 10 only for “pathological” data-sets. X(s)-processes in panels, penalties coded 
by color. 


surface cannot be inferred from the data in settings with corresponding persistent non- 
identifiability. Thus, in cases where our diagnostics indicate persistent non-identifiability, 
estimated coefficient surface values can only be interpreted relative to one another, but no 
interpretation of their sign is possible due to the estimated absolute level being essentially 
arbitrary. A corresponding implication regarding linear trends applies to the case of second 
order difference penalties. 

Summary of simulation results We can draw the following conclusions based on the 
entirety of simulation results: 

• the potential for extreme estimation errors is large for A(s)-processes with low effec¬ 
tive rank whose ke{K^) is not disjunct from Vs±- 

• there is no strong positive correlation between accuracy of fitted values (rIMSEy) and 
accuracy of the estimated coefficient surface (rlMSE^) for rank deficient Dg. Even 
extremely wrong estimates of /3{s,t) can yield good model fits. 

• calculating yields a suitable criterion that can diagnose per¬ 

sistent non-identifiability reliably, albeit with a substantial rate of ’’false alarms” in 
which conventional difference penalties work well despite an antagonistic data situa¬ 
tion. 

• the full-rank roughness penalties often stabilize estimates in persistent non-identifiability 
settings without diminishing accuracy in other settings by a relevant amount, but not 
reliably so. 

• for problematic data settings flagged by the criterion developed here, the combination 
of difference penalties with suitable constraints on the coefficient surface (”A -\- C”), 
ridge penalties (”A*^”) and penalties based on the spectrum of the functional covari¬ 
ate (”EAME”) all perform similarly and outperform both full-rank and conventional 
difference penalties. 


5. Case Study 

To illustrate the practical relevance of the theoretical development given in Section 3 
and the performance of the countermeasures developed therein, we describe a deliberately 
constructed but realistic setting in which non-identifiability persists despite penalization. 
We use a subset of 100 patients from the dti-data described in the introduction and fit a 
realistic model in which investigators try to separate the effect of mean CCA-FA levels on 
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RCST-FA Yi{t) 


RCST-FA estimates Yi{t) 


CCA-FA A,(s) (6 FPCs) 



functional intercept /3o{t) 


functional effect of Xi{s) 


X^(s)i3(s, t)ds 
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t 


centered CCA-FA Af(s) {6 FPCs) 



Figure 7: Top row, left to right: Observed RCST-FA Yi(t); estimated RCST-FA Yi(t) for the model described 
in this Section; presmoothed CCA-FA based on 6 largest FPCs as used in the example shown in 
Figure 1; presmoothed, curve-wise centered CCA-FA based on 6 largest FPCs as used in the example 
below. Bottom row, left to right: Estimated functional intercept /3o(t) and estimated functional 
effect of mean CCA-FA Pi{t), with approximate point-wise 95% conhdence intervals; estimated 
contributions of the functional effect X[(s)$(s,t)ds to the additive predictor. 



Figure 8: Estimated coefhcient surfaces ${s, t) for regressing RCST-FA on mean CCA-FA and centered CCA- 
FA truncated to its first 6 empirical FPCs. All 8 fits lead to practically identical fitted values. 
Top row from left: First order (A^), second order difference penalty (A'^), ridge penalty (A'^), FPC- 
kernel-based penalty (FAME). Bottom row from left: Full rank first order (M-W)), second order 
difference penalty ('A^ (M-W)), first order with constraint on kernel overlap (A^ + C), second order 
with constraint on kernel overlap (A'^ + C). All fits performed with pffrO. Note that z-axis limits 
for the 4 panels on the left differ from each other and from those of the 4 panels on the right. 


RCST-FA from that of the shape of CCA-FA along the tract. Such a model can be defined 
as 


Yi{t) = j3o{t) + Xi(3i{t) + Xf{s)l3(s,t)ds + eu] N{0,a‘^), 


S 
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where Yi{t) is RCST-FA for patient i, /3o(^) is a global functional intercept, Xi = Xi(s) 
are the mean CCA-FA levels with associated effect /3i(t), and Xf{s) ^ Xi{s) — Xi are the 
denoised, curve-wise centered CCA-FA curves. In this example, we use a reconstruction 
A?(s) based on the six largest empirical FPCs (ca. 90 % of total variance explained) to 
denoise CCA-FA and interpolate missing values. Note that curve-specific mean centering 
causes all constant functions to be in the kernel of the covariance of Xf(t), i.e. x — 

.95 for first and second order difference penalties. All fits shown below use Kg = Kt = 12 
marginal basis functions, so by construction we expect k{DJDs) to be large since the rank 
of the functional covariate is at most M = 6. More than 6 marginal basis functions are 
required, however, so that the basis for /3(s,t) is flexible enough for approximating the 
rather complex shape we observe here. Note that curve-wise centering is also a necessary 
pre-processing step for many spectroscopic data analyses where absorption spectra are 
used as functional covariates and different mean intensity levels of such spectra are often 
pure laboratory artifacts (Fuchs et ah, 2015), and more generally in settings such as this 
one where practitioners try to separate effects of the mean level of a functional covariate 
from those of its shape. 

From left to right, the top row of Figure 7 shows the responses and htted values for 
this model, the uncentered functional covariates used for the models shown in Figure 1, 
and the centered functional covariates used for the models in this section. For uncentered 
covariates, this setting provides an example of simple non-identifiability, while centered 
covariates induce persistent non-identifiability for difference penalties. Figure 8 shows the 
estimated coefficient surfaces /3{s,t) for this model for various penalties. Despite the di¬ 
vergent shapes of the different coefficient surface estimates, all 8 fits lead to practically 
identical fitted values on the training data (100 patients) and practically identical pre¬ 
dictions for the test set (155 patients), with integrated MSE 0.0031 and predictive 
integrated MSE ss 0.0046 for each model. Both /3o{t) and /3i(t) are also practically identi¬ 
cal for all 8 models, they are shown in the bottom row of Eigure 7 along with the estimated 
contributions Xf{s)/3{s,t)ds of the functional covariate to the additive predictor. 

The model specification is appropriately flagged as causing persistent non-identifiability 
(i.e., (lx, = .98 [1.15] and k{DJD g ) —oo) for first [second] order difference penalties 
[A^]. It is instructive to compare the resulting estimates for /3(s, t) for different penalty 
specifications: the first difference penalty fit (top, left) includes a constant offset from 
0, while the second difference penalty fit (top, second from left) includes different offsets 
from 0 for each t, in a linear decrease that is unpenalized by the marginal second order 
difference penalty in t direction, as well as a linear trend in s. The presence of this 
spurious linear trend is surprising since we did not explicitly remove linear components 
from the functional covariate and it is not present in the fits for the uncentered data (c.f. 
Eigure 1). Eurther investigation revealed that the centered functional covariate’s null-space 
contains a fairly strong linear component ((X'^) J, w-s) = 0.77, where (A1‘’)J denotes 

the orthogonal complement of the observed X'^(s)), which is overlap enough to produce 
spurious linear shifts in this example. Note that the linear component is much stronger 
in the non-centered covariates ([^ 2 .v((^)I,= 0.11) and its lack is caused by the 
curve-wise centering in this example. It is straightforward to show that this can occur if 
f Xi(s)sds ^ Xi f sds Vi = 1,..., n, as is the case here. 

Both offset and trends are reduced, but not entirely removed, by instead using the full- 
rank difference penalties with e = 0.1 (bottom row, two leftmost panels). Due to the 
wiggliness of the coefficient surface, the estimated smoothing parameters are quite small, 
so there is only little penalization going on. Consequently, the additional small penalty on 
Vg± is not strong enough to eliminate non-identifiability artifacts from the fit in this case. 
Estimated coefficient surfaces with full-rank difference penalties with e = 1 (not shown), 
however, are very similar to the those in the four right-most panels of Eigure 8. That 
results for the full-rank difference penalties depend so strongly on this tuning parameter 
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is a considerable disadvantage. 

The four right-most panels show results based on the ridge penalty (A®, second from 
right, top row), the penalty based on the FPCs of 'ke^K^) (FAME, top right), and the 
difference penalties with additional “orthogonal-to-kernel-overlap” constraints (A-j-C, bot¬ 
tom row). It is reassuring to see that all four of these penalties lead to very similar results in 
this setting despite their different motivations and mathematical properties. Admittedly, 
understanding the estimated coefficient surfaces in terms of the supposed data generating 
process remains difficult. First, because of their complex shape and secondly (and more 
pertinently to the main points of this paper), because the restriction to surfaces centered 
around 0 that is enforced for IS. + C and implied by the A^ and ke(Ar^) penalties pre¬ 
cludes facile interpretation of, e.g., the sign of features of /3(s,t). Since the “true” offset 
of the coefficient surface is not estimable from the data, negativity or positivity of certain 
peaks or troughs of /3(s,t) is not directly interpretable and interpretation can thus only 
rely on relative heights across the surface. Higher dimensional overlaps between penalty 
and covariate null-spaces than the one encountered here will compound these expositional 
difficulties. 

This example demonstrates how reasonable, but unfortunate combinations of data pre¬ 
processing and model specifications can lead to simply as well as persistently non-identifiable 
models despite penalization. The diagnostics and countermeasures developed in Section 
3, however, seem to be suitable for detecting and remedying such problems in a real data 
setting. 

6. Conclusion and Discussion 

Coefficient surface estimates in spline-based function-on-function-regression (1) can suffer 
from persistent identifiability problems if the span of the marginal basis for the coefficient 
surface over a functional covariate’s domain overlaps the kernel of its covariance operator. 
A rank deficient design matrix can occur in particular if the functional covariate’s covari¬ 
ance operator is of effective rank smaller than the number of marginal basis functions - 
either because the number of eigenfunctions with non-zero eigenvalues is truly below the 
number of marginal basis functions, or if eigenvalues of the covariance operator decrease 
too rapidly compared to the noise level of the data. 

In practice, spline based approaches are typically fitted with a regularization penalty 
corresponding to a smoothness assumption on the coefficient surface. We have shown 
that identifiability problems persist if, and only if, in addition to a rank deficiency of 
the design matrix, the kernel of the functional predictor’s covariance ke(iF^) overlaps 
the function space Vsi. spanned by parameter vectors in the null-space of the spline’s 
roughness penalty. In the case of no overlap, there is a unique smoothest representative 
on any hyperplane of parameter vectors leading to the same additive predictor and the 
penalized estimation problem finds a unique smoothest solution. Similar results hold for 
the simpler case of penalized scalar-on-function regression models (Happ, 2013). They are 
also expected to hold for more general loss functions than the quadratic loss analysed here 
since generalized additive models are typically estimated by the penalized iteratively re¬ 
weighted least squares (P-IWLS) method, where the system of equations solved in each step 
is identical to that of (8) except for the introduction of a vector of IWLS weights and the 
substitution of y by IWLS working responses that are element-wise linear transformations 
of y. 

A lack of identifiability also implies a lack of correlation between accuracy of the co¬ 
efficient estimates and goodness of ht for the responses. As this extends to prediction 
errors for out-of-sample data from the same process, it is usually not possible to detect 
identifiability issues for a given data set based on subsampling or cross-validation schemes. 
Instead, based on theoretical considerations and simulation results, we have identified two 
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easily computable diagnostic criteria in order to detect non-identifiable model specifica¬ 
tions before estimation. The criteria combine the condition number of a partial design 
matrix with a measure of the amount of overlap between the kernel of the functional pre¬ 
dictor’s covariance and the null-space of the penalty. Non-identifiability may in particular 
be an issue if both criteria are indicative of a problematic setting, or if the partial design 
matrix is numerically rank deficient and the penalty smoothing parameter is estimated 
to be close to zero. If a non-identifiable model specification is discovered, we recommend 
that practitioners choose modified full-rank roughness penalties to safeguard against spu¬ 
rious estimates or estimate coefficient surfaces under constraints that force these spurious 
components to zero. The pffr-function in the refund package uses a first order differ¬ 
ences penalty by default, incorporates the diagnostic checks developed and evaluated in 
this work, and issues corresponding user warnings and enforces suitable constraints if a 
persistently non-identifiable model specification is detected. 

Another practical consequence of our results is that pre-processing methods for func¬ 
tional covariates should be avoided if they reduce their effective rank (such as pre-smoothing 
with low-dimensional bases) or if they increase the amount of overlap between ke(Ar^) and 
Vs± (such as curve-wise centering). 

Jointly, these provisions seem to be sufficient to diagnose and safeguard against most 
serious artifacts of non-identifiability in practice. Our results indicate that in many cases, 
penalization allows the reasonable estimation of coefficient surfaces that are not identifiable 
in the theoretical model under an additional smoothness assumption, avoiding instead the 
common assumption that the estimated coefficient surface lies in the span of the first few 
eigenfunctions of the covariance operator of the covariate. 

This work drives home the point that we cannot hope to reliably estimate arbitrarily 
complex effect shapes from functional covariates with low information content. In that 
sense, assuming smoothness of the coefficient surface and constraining its non-identifiable 
components to be zero is simply following a principle of parsimony. At the same time, sub¬ 
stantial interpretation of coefficient surface estimates derived from rank-deficient designs 
is difficult and has the potential to be very misleading. Functional principal component 
regression approaches do not suffer from the potential identifiability issues discussed here, 
but they do so at the price of restricting the estimated coefficient surface to the span of the 
estimated functional principal components. These are significant challenges for the matur¬ 
ing field of functional regression methods, at least in applications where these methods are 
used not only for prediction, but also for inferring and understanding the underlying data 
generating processes. It is our hope that the theoretical development and practical exam¬ 
ples presented here can serve as a starting point for critical reflection on this important 
issue. 
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A. Proofs 


A.l. Proof of Proposition 3.2 

Assume that Bt is of full rank Kf. Then, the design matrix D = Bt® (XWBg) in model 
(7) is rank-deficient if and only if 

a) M < Kg or 

b) if M > Kg, but lanki^WBg) < Kg . 

Proof. As rank(X) = rank(H$) = rank($) = M, 

rank($l^S^) > rank(H$l^B«) = ia-ak[XWBg) 

> rank(H$) -|- rank($VFSs) — rank($) = rank($VF-Bs) 

using Harville (1997, Th. 17.5.1). Thus, rank(XVFBs) = rank($kl^Ss). The rank will 
be less than full if rank($'H^Ss) < Kg, including if M < Kg, as rank($kl^Ss) < M by 
construction. □ 


A.2. Proof of Proposition 3.3 

Let P = \g{lKt®Ps)+^t{Pt®lKs)i with Pg and Pt positive semi-definite matrices. Assume 
that Bt is of full rank Kt, that rank(i^t) < Kt and that As > 0, At > 0. Then, for any 
f G im(iA) there is a unique minimum PO] if and only if ke(iAjDs)nke(i-’s) = 

{ 0 }. 

Proof. We have 


min{0^P0} 


= mm{e^UU'^PUU'^e} 

= min{{e^u+\e^Uo) (^ujput 


ulPUo\ (uje\ 
UjPUoJ 


s.t. 0 £PLf 
s.t. DO = f 


s.t. UjO = f. 


Denote u_|_ = U^O and vq = UqO, with = U^O a bijective re-parametrization 

of 0. Note that is fixed while vq is free to vary within the hyperplane. Setting the 
derivative with respect to vq equal to zero yields 

PUoVo = -Ufl PU+V+. (10) 


Now, if ke(ZlJUs) H ke(Ps) / {0}, choose Ug 0 with UgoUg G ke(iA7Zls) H ke(Ps) and 
Ut ^ 0 with UtUt G ke(Pt). As Uq = Ut ® Ugo, we have 

{uj ®uJ)UjPUo{ut®Ug) = Xg{uJut)uj UjoPgUsoUg -7 XtuJ Uj PtUtUt{u] Ug) = 0. 


Thus, Uq PUq is not of full rank, no unique solution vq of (10) exists, so there is no unique 
minimum P9}. 

On the other hand, if ke(iAjZls) O ke(Ps) = {0}, for any x with x~^UjQPgUsox = 0 
we have Usqx G ke(iAjiAs) O ke(Ps) = {0} and thus x = UJqUsqx = 0. As this means 
that UJqPsUso is positive definite, we also have that PUq = Xglxt ® (UjoPsUgo) + 
Xt{u;^PtUt) ® I{Ks-d/Kt) is positive dehnite and thus invertible. Therefore, there is a 
unique minimum vq = —{UjPU o)~^Uq PU.^.v^ and a unique smoothest point 

Of = Uivl,-[{u;iPUo)-^U;iPU+v+]'^)'^ = HU+v+ = H0, (11) 

with H = (iKsKt - Uo{UjPUo)-^UjP) and OjPOf = min^g^^ O^PO. □ 
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A.3. Proof of Proposition 3.4 

Assume that Bt is of full rank Kt, that rank(Pt) < Kt and that > 0, Ai > 0. Then, there 
is a unique penalized least squares solution for ( 8 ) if and only if ke(iAj Ds)r]ke{Ps) = {0}. 

Proof. Problem ( 8 ) has a unique solution iff {D^D + \s{lKt ® -Ps) + M{Pt ® Iks)) > 0 is 
invertible, i.e., positive definite. Now, suppose that ke(iAjiAs) n ke(Ps) = {0}. For any 
X G with 


x~^{D^D + \s{lKt ® Ps) + \{Pt ® Iks))x = 0, 
we have, with h = x = {bkj)kj£{ii,i2,...,KtKs}: 

= 6^ (e? ® 0 ))'> = » ,12) 

=> bkj = Oy I < j < d/Ku I < k < Kt, 


and also 

b'^ilK, ® {{Us+\UsoVPs{Us+\Uso))]b b^{lK, ® iUjoPsUso)]b = 0, 

where b is obtained by removing the zero entries given by ( 12 ) which correspond to Us+ 
from b. Thus, for all 1 < A: < Kt, and letting bk = {bk{d/Kt+i)^ ■ ■ ■ ^bkKa), we have 
Usobk £ ke{DjDg) n ke(Pj) = {0}. Thus, 6 = 0, 6 = 0, a; = 0 and {D^D + Xs{lKt ® 
Ps) + \t{Pt ® Iks)) is of full rank. 

On the other hand, if ke(iAj Dg) H ke(Pi) 7 ^ {0}, there is a 7 ^ 0 with DgUg = 0 and 
PgUg = 0. Choose 0 ut € keP*. Then, 

(uj (g) uJ){D^D + Xg{lKt ® Ps) + M{Pt ® lKs)){ut ® Ug) 

= uJ bJB tUt ■ 0 + Xgul • 0 + At • 0 • ItJiis = 0. 

Thus, {D^D + As {J-Kt ® Ps) + ^t{Pt ® J-Ks)) is singular and not invertible. D 
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